self

\(Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V\)

Causal Masking:

Scalability:

Multi-head:

Parameter calculation: